Challenge 1.2
Challenge 1.3
Challenge 1.4
data_raw directory?## load the tidyverse packages, incl. dplyr
library(tidyverse)
When you load tidyverse:
- dplyr: collection of tools to easily manipulate the data
- tidyr: reshaping the data for plotting
- ggplot2: for advanced plotting
- tibble: modern dataframe
dplyr
cheatsheet
tidyr
cheatsheet
download.file(
url = "https://nbisweden.github.io/module-r-intro-dm-practices/data/Hawks.csv",
destfile = "data_raw/Hawks.csv"
)
read_csv from tidyverse package == read.csv from base
R.
# silently read in CSV file with data
hawks <- read_csv("data_raw/Hawks.csv") #, col_names = T)
## Rows: 908 Columns: 19
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): CaptureTime, BandNumber, Species, Age, Sex
## dbl (13): Month, Day, Year, Wing, Weight, Culmen, Hallux, Tail, StandardTai...
## time (1): ReleaseTime
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
str(hawks)
## spc_tbl_ [908 × 19] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ Month : num [1:908] 9 9 9 9 9 9 9 9 9 9 ...
## $ Day : num [1:908] 19 22 23 23 27 28 28 29 29 30 ...
## $ Year : num [1:908] 1992 1992 1992 1992 1992 ...
## $ CaptureTime : chr [1:908] "13:30" "10:30" "12:45" "10:50" ...
## $ ReleaseTime : 'hms' num [1:908] NA NA NA NA ...
## ..- attr(*, "units")= chr "secs"
## $ BandNumber : chr [1:908] "877-76317" "877-76318" "877-76319" "745-49508" ...
## $ Species : chr [1:908] "RT" "RT" "RT" "CH" ...
## $ Age : chr [1:908] "I" "I" "I" "I" ...
## $ Sex : chr [1:908] NA NA NA "F" ...
## $ Wing : num [1:908] 385 376 381 265 205 412 370 375 412 405 ...
## $ Weight : num [1:908] 920 930 990 470 170 1090 960 855 1210 1120 ...
## $ Culmen : num [1:908] 25.7 NA 26.7 18.7 12.5 28.5 25.3 27.2 29.3 26 ...
## $ Hallux : num [1:908] 30.1 NA 31.3 23.5 14.3 32.2 30.1 30 31.3 30.2 ...
## $ Tail : num [1:908] 219 221 235 220 157 230 212 243 210 238 ...
## $ StandardTail: num [1:908] NA NA NA NA NA NA NA NA NA NA ...
## $ Tarsus : num [1:908] NA NA NA NA NA NA NA NA NA NA ...
## $ WingPitFat : num [1:908] NA NA NA NA NA NA NA NA NA NA ...
## $ KeelFat : num [1:908] NA NA NA NA NA NA NA NA NA NA ...
## $ Crop : num [1:908] NA NA NA NA NA NA NA NA NA NA ...
## - attr(*, "spec")=
## .. cols(
## .. Month = col_double(),
## .. Day = col_double(),
## .. Year = col_double(),
## .. CaptureTime = col_character(),
## .. ReleaseTime = col_time(format = ""),
## .. BandNumber = col_character(),
## .. Species = col_character(),
## .. Age = col_character(),
## .. Sex = col_character(),
## .. Wing = col_double(),
## .. Weight = col_double(),
## .. Culmen = col_double(),
## .. Hallux = col_double(),
## .. Tail = col_double(),
## .. StandardTail = col_double(),
## .. Tarsus = col_double(),
## .. WingPitFat = col_double(),
## .. KeelFat = col_double(),
## .. Crop = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
When you execute read_csv on a data file, it looks
through the first 1000 rows of each column and guesses its data type.
Use the function spec() to view the full column
specification:
columns = vector (datatype)
view(hawks)
hawks
## # A tibble: 908 × 19
## Month Day Year CaptureTime ReleaseTime BandNumber Species Age Sex
## <dbl> <dbl> <dbl> <chr> <time> <chr> <chr> <chr> <chr>
## 1 9 19 1992 13:30 NA 877-76317 RT I <NA>
## 2 9 22 1992 10:30 NA 877-76318 RT I <NA>
## 3 9 23 1992 12:45 NA 877-76319 RT I <NA>
## 4 9 23 1992 10:50 NA 745-49508 CH I F
## 5 9 27 1992 11:15 NA 1253-98801 SS I F
## 6 9 28 1992 11:25 NA 1207-55910 RT I <NA>
## 7 9 28 1992 13:30 NA 877-76320 RT I <NA>
## 8 9 29 1992 11:45 NA 877-76321 RT A <NA>
## 9 9 29 1992 15:35 NA 877-76322 RT A <NA>
## 10 9 30 1992 13:45 NA 1207-55911 RT I <NA>
## # ℹ 898 more rows
## # ℹ 10 more variables: Wing <dbl>, Weight <dbl>, Culmen <dbl>, Hallux <dbl>,
## # Tail <dbl>, StandardTail <dbl>, Tarsus <dbl>, WingPitFat <dbl>,
## # KeelFat <dbl>, Crop <dbl>
head(hawks)
## # A tibble: 6 × 19
## Month Day Year CaptureTime ReleaseTime BandNumber Species Age Sex Wing
## <dbl> <dbl> <dbl> <chr> <time> <chr> <chr> <chr> <chr> <dbl>
## 1 9 19 1992 13:30 NA 877-76317 RT I <NA> 385
## 2 9 22 1992 10:30 NA 877-76318 RT I <NA> 376
## 3 9 23 1992 12:45 NA 877-76319 RT I <NA> 381
## 4 9 23 1992 10:50 NA 745-49508 CH I F 265
## 5 9 27 1992 11:15 NA 1253-98801 SS I F 205
## 6 9 28 1992 11:25 NA 1207-55910 RT I <NA> 412
## # ℹ 9 more variables: Weight <dbl>, Culmen <dbl>, Hallux <dbl>, Tail <dbl>,
## # StandardTail <dbl>, Tarsus <dbl>, WingPitFat <dbl>, KeelFat <dbl>,
## # Crop <dbl>
tail(hawks)
## # A tibble: 6 × 19
## Month Day Year CaptureTime ReleaseTime BandNumber Species Age Sex Wing
## <dbl> <dbl> <dbl> <chr> <time> <chr> <chr> <chr> <chr> <dbl>
## 1 11 18 2003 14:07 NA 1207-53144 RT I <NA> 366
## 2 11 18 2003 14:44 NA 1177-04777 RT I <NA> 380
## 3 11 19 2003 10:18 NA 803-05985 SS I F 190
## 4 11 19 2003 12:02 NA 1807-53145 RT I <NA> 360
## 5 11 20 2003 9:56 NA 1177-04778 RT I <NA> 369
## 6 11 20 2003 13:30 NA 1207-53145 RT A <NA> 199
## # ℹ 9 more variables: Weight <dbl>, Culmen <dbl>, Hallux <dbl>, Tail <dbl>,
## # StandardTail <dbl>, Tarsus <dbl>, WingPitFat <dbl>, KeelFat <dbl>,
## # Crop <dbl>
str(hawks)#structure
## spc_tbl_ [908 × 19] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ Month : num [1:908] 9 9 9 9 9 9 9 9 9 9 ...
## $ Day : num [1:908] 19 22 23 23 27 28 28 29 29 30 ...
## $ Year : num [1:908] 1992 1992 1992 1992 1992 ...
## $ CaptureTime : chr [1:908] "13:30" "10:30" "12:45" "10:50" ...
## $ ReleaseTime : 'hms' num [1:908] NA NA NA NA ...
## ..- attr(*, "units")= chr "secs"
## $ BandNumber : chr [1:908] "877-76317" "877-76318" "877-76319" "745-49508" ...
## $ Species : chr [1:908] "RT" "RT" "RT" "CH" ...
## $ Age : chr [1:908] "I" "I" "I" "I" ...
## $ Sex : chr [1:908] NA NA NA "F" ...
## $ Wing : num [1:908] 385 376 381 265 205 412 370 375 412 405 ...
## $ Weight : num [1:908] 920 930 990 470 170 1090 960 855 1210 1120 ...
## $ Culmen : num [1:908] 25.7 NA 26.7 18.7 12.5 28.5 25.3 27.2 29.3 26 ...
## $ Hallux : num [1:908] 30.1 NA 31.3 23.5 14.3 32.2 30.1 30 31.3 30.2 ...
## $ Tail : num [1:908] 219 221 235 220 157 230 212 243 210 238 ...
## $ StandardTail: num [1:908] NA NA NA NA NA NA NA NA NA NA ...
## $ Tarsus : num [1:908] NA NA NA NA NA NA NA NA NA NA ...
## $ WingPitFat : num [1:908] NA NA NA NA NA NA NA NA NA NA ...
## $ KeelFat : num [1:908] NA NA NA NA NA NA NA NA NA NA ...
## $ Crop : num [1:908] NA NA NA NA NA NA NA NA NA NA ...
## - attr(*, "spec")=
## .. cols(
## .. Month = col_double(),
## .. Day = col_double(),
## .. Year = col_double(),
## .. CaptureTime = col_character(),
## .. ReleaseTime = col_time(format = ""),
## .. BandNumber = col_character(),
## .. Species = col_character(),
## .. Age = col_character(),
## .. Sex = col_character(),
## .. Wing = col_double(),
## .. Weight = col_double(),
## .. Culmen = col_double(),
## .. Hallux = col_double(),
## .. Tail = col_double(),
## .. StandardTail = col_double(),
## .. Tarsus = col_double(),
## .. WingPitFat = col_double(),
## .. KeelFat = col_double(),
## .. Crop = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
spec(hawks)
## cols(
## Month = col_double(),
## Day = col_double(),
## Year = col_double(),
## CaptureTime = col_character(),
## ReleaseTime = col_time(format = ""),
## BandNumber = col_character(),
## Species = col_character(),
## Age = col_character(),
## Sex = col_character(),
## Wing = col_double(),
## Weight = col_double(),
## Culmen = col_double(),
## Hallux = col_double(),
## Tail = col_double(),
## StandardTail = col_double(),
## Tarsus = col_double(),
## WingPitFat = col_double(),
## KeelFat = col_double(),
## Crop = col_double()
## )
summary(hawks)
## Month Day Year CaptureTime
## Min. : 8.000 Min. : 1.00 Min. :1992 Length:908
## 1st Qu.: 9.000 1st Qu.: 9.00 1st Qu.:1995 Class :character
## Median :10.000 Median :16.00 Median :1999 Mode :character
## Mean : 9.843 Mean :15.74 Mean :1998
## 3rd Qu.:10.000 3rd Qu.:23.00 3rd Qu.:2001
## Max. :11.000 Max. :31.00 Max. :2003
##
## ReleaseTime BandNumber Species Age
## Length:908 Length:908 Length:908 Length:908
## Class1:hms Class :character Class :character Class :character
## Class2:difftime Mode :character Mode :character Mode :character
## Mode :numeric
##
##
##
## Sex Wing Weight Culmen
## Length:908 Min. : 37.2 Min. : 56.0 Min. : 8.6
## Class :character 1st Qu.:202.0 1st Qu.: 185.0 1st Qu.:12.8
## Mode :character Median :370.0 Median : 970.0 Median :25.5
## Mean :315.6 Mean : 772.1 Mean :21.8
## 3rd Qu.:390.0 3rd Qu.:1120.0 3rd Qu.:27.3
## Max. :480.0 Max. :2030.0 Max. :39.2
## NA's :1 NA's :10 NA's :7
## Hallux Tail StandardTail Tarsus
## Min. : 9.50 Min. :119.0 Min. :115.0 Min. :24.70
## 1st Qu.: 15.10 1st Qu.:160.0 1st Qu.:162.0 1st Qu.:55.60
## Median : 29.40 Median :214.0 Median :215.0 Median :79.30
## Mean : 26.41 Mean :198.8 Mean :199.2 Mean :71.95
## 3rd Qu.: 31.40 3rd Qu.:225.0 3rd Qu.:226.0 3rd Qu.:87.00
## Max. :341.40 Max. :288.0 Max. :335.0 Max. :94.00
## NA's :6 NA's :337 NA's :833
## WingPitFat KeelFat Crop
## Min. :0.0000 Min. :0.000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:2.000 1st Qu.:0.0000
## Median :1.0000 Median :2.000 Median :0.0000
## Mean :0.7922 Mean :2.184 Mean :0.2345
## 3rd Qu.:1.0000 3rd Qu.:3.000 3rd Qu.:0.2500
## Max. :3.0000 Max. :4.000 Max. :5.0000
## NA's :831 NA's :341 NA's :343
nrow(hawks)
## [1] 908
ncol(hawks)
## [1] 19
dim(hawks)
## [1] 908 19
hawks$CaptureTimeptureTime[!is.character(hawks$CaptureTime)]
## Warning: Unknown or uninitialised column: `CaptureTimeptureTime`.
## NULL
# first element in the first column of the data frame
hawks[1, 1]
## # A tibble: 1 × 1
## Month
## <dbl>
## 1 9
# first element in the 6th column
hawks[1, 6]
## # A tibble: 1 × 1
## BandNumber
## <chr>
## 1 877-76317
# first column of the data frame
hawks[, 1]
## # A tibble: 908 × 1
## Month
## <dbl>
## 1 9
## 2 9
## 3 9
## 4 9
## 5 9
## 6 9
## 7 9
## 8 9
## 9 9
## 10 9
## # ℹ 898 more rows
# first column of the data frame
hawks[1]
## # A tibble: 908 × 1
## Month
## <dbl>
## 1 9
## 2 9
## 3 9
## 4 9
## 5 9
## 6 9
## 7 9
## 8 9
## 9 9
## 10 9
## # ℹ 898 more rows
# first three rows of the 6th column
hawks[1:3, 6]
## # A tibble: 3 × 1
## BandNumber
## <chr>
## 1 877-76317
## 2 877-76318
## 3 877-76319
# the 3rd row of the data frame
hawks[3, ]
## # A tibble: 1 × 19
## Month Day Year CaptureTime ReleaseTime BandNumber Species Age Sex Wing
## <dbl> <dbl> <dbl> <chr> <time> <chr> <chr> <chr> <chr> <dbl>
## 1 9 23 1992 12:45 NA 877-76319 RT I <NA> 381
## # ℹ 9 more variables: Weight <dbl>, Culmen <dbl>, Hallux <dbl>, Tail <dbl>,
## # StandardTail <dbl>, Tarsus <dbl>, WingPitFat <dbl>, KeelFat <dbl>,
## # Crop <dbl>
# equivalent to head_hawks <- head(hawks)
head_hawks <- hawks[1:6, ]
# dropping column
# select columns by name
hawks$Month[1:6]
## [1] 9 9 9 9 9 9
hawks[,'Month']
## # A tibble: 908 × 1
## Month
## <dbl>
## 1 9
## 2 9
## 3 9
## 4 9
## 5 9
## 6 9
## 7 9
## 8 9
## 9 9
## 10 9
## # ℹ 898 more rows
hawks[1,c('Year', "ReleaseTime", 'Sex')]
## # A tibble: 1 × 3
## Year ReleaseTime Sex
## <dbl> <time> <chr>
## 1 1992 NA <NA>
sum(!is.na(hawks$Sex))
## [1] 332
hawks_known_sex <- hawks[!is.na(hawks$Sex),]
dim(hawks_known_sex)
## [1] 332 19
Challenge 2.2
Categorical data
levels(hawks$Sex)
## NULL
hawks$Sex <- factor(x = hawks$Sex, levels = c('M', 'F'), labels = c('Male', 'Female'))
levels(hawks$Sex)
## [1] "Male" "Female"
summary(hawks$Sex)
## Male Female NA's
## 158 174 576
Challenge 2.3
plot(hawks$Sex)
sex <- hawks$Sex
levels(sex)
## [1] "Male" "Female"
sex <- addNA(sex)
levels(sex)
## [1] "Male" "Female" NA
levels(sex)[3] <- 'Unknown'
'Unknown1' -> levels(sex)[3]
plot(sex)
Challenge 2.4
x <- 1:10
exp_x <- exp(x)
plot(exp_x, type = 'l', xlab = 'time', ylab = 'progress')
lines(x * 500, col = 'red')
hawks <- read_csv("data_raw/Hawks.csv") #, col_names = T)
## Rows: 908 Columns: 19
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): CaptureTime, BandNumber, Species, Age, Sex
## dbl (13): Month, Day, Year, Wing, Weight, Culmen, Hallux, Tail, StandardTai...
## time (1): ReleaseTime
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
select(hawks, Species, Sex, Weight)
## # A tibble: 908 × 3
## Species Sex Weight
## <chr> <chr> <dbl>
## 1 RT <NA> 920
## 2 RT <NA> 930
## 3 RT <NA> 990
## 4 CH F 470
## 5 SS F 170
## 6 RT <NA> 1090
## 7 RT <NA> 960
## 8 RT <NA> 855
## 9 RT <NA> 1210
## 10 RT <NA> 1120
## # ℹ 898 more rows
select(hawks, -BandNumber, -Culmen)
## # A tibble: 908 × 17
## Month Day Year CaptureTime ReleaseTime Species Age Sex Wing Weight
## <dbl> <dbl> <dbl> <chr> <time> <chr> <chr> <chr> <dbl> <dbl>
## 1 9 19 1992 13:30 NA RT I <NA> 385 920
## 2 9 22 1992 10:30 NA RT I <NA> 376 930
## 3 9 23 1992 12:45 NA RT I <NA> 381 990
## 4 9 23 1992 10:50 NA CH I F 265 470
## 5 9 27 1992 11:15 NA SS I F 205 170
## 6 9 28 1992 11:25 NA RT I <NA> 412 1090
## 7 9 28 1992 13:30 NA RT I <NA> 370 960
## 8 9 29 1992 11:45 NA RT A <NA> 375 855
## 9 9 29 1992 15:35 NA RT A <NA> 412 1210
## 10 9 30 1992 13:45 NA RT I <NA> 405 1120
## # ℹ 898 more rows
## # ℹ 7 more variables: Hallux <dbl>, Tail <dbl>, StandardTail <dbl>,
## # Tarsus <dbl>, WingPitFat <dbl>, KeelFat <dbl>, Crop <dbl>
filter(hawks, Sex == "F")
## # A tibble: 174 × 19
## Month Day Year CaptureTime ReleaseTime BandNumber Species Age Sex
## <dbl> <dbl> <dbl> <chr> <time> <chr> <chr> <chr> <chr>
## 1 9 23 1992 10:50 NA 745-49508 CH I F
## 2 9 27 1992 11:15 NA 1253-98801 SS I F
## 3 10 27 1992 10:05 NA 1253-98802 SS I F
## 4 9 29 1993 10:25 NA 1253-98803 SS I F
## 5 10 1 1993 10:20 NA 745-49512 CH I F
## 6 10 12 1993 13:15 NA 745-49515 CH I F
## 7 10 14 1993 14:05 NA 1373-35272 SS A F
## 8 9 8 1994 12:10 NA 1423-16201 SS I F
## 9 9 9 1994 9:02 NA 2003-58433 SS I F
## 10 9 20 1994 9:05 NA 2003-58435 SS A F
## # ℹ 164 more rows
## # ℹ 10 more variables: Wing <dbl>, Weight <dbl>, Culmen <dbl>, Hallux <dbl>,
## # Tail <dbl>, StandardTail <dbl>, Tarsus <dbl>, WingPitFat <dbl>,
## # KeelFat <dbl>, Crop <dbl>
We can also filter rows that do not contain missing data in some columns:
filter(hawks, !is.na(Sex) & !is.na(Weight))
## # A tibble: 327 × 19
## Month Day Year CaptureTime ReleaseTime BandNumber Species Age Sex
## <dbl> <dbl> <dbl> <chr> <time> <chr> <chr> <chr> <chr>
## 1 9 23 1992 10:50 NA 745-49508 CH I F
## 2 9 27 1992 11:15 NA 1253-98801 SS I F
## 3 10 23 1992 16:05 NA 1173-19901 SS I M
## 4 10 27 1992 10:05 NA 1253-98802 SS I F
## 5 9 13 1993 14:25 NA 173-19904 SS I M
## 6 9 17 1993 15:25 NA 193-19905 SS I M
## 7 9 29 1993 10:25 NA 1253-98803 SS I F
## 8 10 1 1993 10:20 NA 745-49512 CH I F
## 9 10 1 1993 10:45 NA 745-49513 CH A M
## 10 10 11 1993 11:35 NA 1173-19906 SS I M
## # ℹ 317 more rows
## # ℹ 10 more variables: Wing <dbl>, Weight <dbl>, Culmen <dbl>, Hallux <dbl>,
## # Tail <dbl>, StandardTail <dbl>, Tarsus <dbl>, WingPitFat <dbl>,
## # KeelFat <dbl>, Crop <dbl>
This will return all rows that have a value in both the
Sex column and the Weight column. In
Tidyverse, there is also a special
functions drop_na that can be used to filter out rows with
missing data:
drop_na(hawks, Sex, Weight)
## # A tibble: 327 × 19
## Month Day Year CaptureTime ReleaseTime BandNumber Species Age Sex
## <dbl> <dbl> <dbl> <chr> <time> <chr> <chr> <chr> <chr>
## 1 9 23 1992 10:50 NA 745-49508 CH I F
## 2 9 27 1992 11:15 NA 1253-98801 SS I F
## 3 10 23 1992 16:05 NA 1173-19901 SS I M
## 4 10 27 1992 10:05 NA 1253-98802 SS I F
## 5 9 13 1993 14:25 NA 173-19904 SS I M
## 6 9 17 1993 15:25 NA 193-19905 SS I M
## 7 9 29 1993 10:25 NA 1253-98803 SS I F
## 8 10 1 1993 10:20 NA 745-49512 CH I F
## 9 10 1 1993 10:45 NA 745-49513 CH A M
## 10 10 11 1993 11:35 NA 1173-19906 SS I M
## # ℹ 317 more rows
## # ℹ 10 more variables: Wing <dbl>, Weight <dbl>, Culmen <dbl>, Hallux <dbl>,
## # Tail <dbl>, StandardTail <dbl>, Tarsus <dbl>, WingPitFat <dbl>,
## # KeelFat <dbl>, Crop <dbl>
%>%)Ctrl + Shift + M Cmd + Shift + M
x %>% f ==> f(x) x %>% f(y) ==> f(x,y)
hawks[1:5,]
## # A tibble: 5 × 19
## Month Day Year CaptureTime ReleaseTime BandNumber Species Age Sex Wing
## <dbl> <dbl> <dbl> <chr> <time> <chr> <chr> <chr> <chr> <dbl>
## 1 9 19 1992 13:30 NA 877-76317 RT I <NA> 385
## 2 9 22 1992 10:30 NA 877-76318 RT I <NA> 376
## 3 9 23 1992 12:45 NA 877-76319 RT I <NA> 381
## 4 9 23 1992 10:50 NA 745-49508 CH I F 265
## 5 9 27 1992 11:15 NA 1253-98801 SS I F 205
## # ℹ 9 more variables: Weight <dbl>, Culmen <dbl>, Hallux <dbl>, Tail <dbl>,
## # StandardTail <dbl>, Tarsus <dbl>, WingPitFat <dbl>, KeelFat <dbl>,
## # Crop <dbl>
head(hawks, n = 5)
## # A tibble: 5 × 19
## Month Day Year CaptureTime ReleaseTime BandNumber Species Age Sex Wing
## <dbl> <dbl> <dbl> <chr> <time> <chr> <chr> <chr> <chr> <dbl>
## 1 9 19 1992 13:30 NA 877-76317 RT I <NA> 385
## 2 9 22 1992 10:30 NA 877-76318 RT I <NA> 376
## 3 9 23 1992 12:45 NA 877-76319 RT I <NA> 381
## 4 9 23 1992 10:50 NA 745-49508 CH I F 265
## 5 9 27 1992 11:15 NA 1253-98801 SS I F 205
## # ℹ 9 more variables: Weight <dbl>, Culmen <dbl>, Hallux <dbl>, Tail <dbl>,
## # StandardTail <dbl>, Tarsus <dbl>, WingPitFat <dbl>, KeelFat <dbl>,
## # Crop <dbl>
hawks %>% head(n = 5)
## # A tibble: 5 × 19
## Month Day Year CaptureTime ReleaseTime BandNumber Species Age Sex Wing
## <dbl> <dbl> <dbl> <chr> <time> <chr> <chr> <chr> <chr> <dbl>
## 1 9 19 1992 13:30 NA 877-76317 RT I <NA> 385
## 2 9 22 1992 10:30 NA 877-76318 RT I <NA> 376
## 3 9 23 1992 12:45 NA 877-76319 RT I <NA> 381
## 4 9 23 1992 10:50 NA 745-49508 CH I F 265
## 5 9 27 1992 11:15 NA 1253-98801 SS I F 205
## # ℹ 9 more variables: Weight <dbl>, Culmen <dbl>, Hallux <dbl>, Tail <dbl>,
## # StandardTail <dbl>, Tarsus <dbl>, WingPitFat <dbl>, KeelFat <dbl>,
## # Crop <dbl>
hawks %>% head(n=5) %>% tail(n=1)
## # A tibble: 1 × 19
## Month Day Year CaptureTime ReleaseTime BandNumber Species Age Sex Wing
## <dbl> <dbl> <dbl> <chr> <time> <chr> <chr> <chr> <chr> <dbl>
## 1 9 27 1992 11:15 NA 1253-98801 SS I F 205
## # ℹ 9 more variables: Weight <dbl>, Culmen <dbl>, Hallux <dbl>, Tail <dbl>,
## # StandardTail <dbl>, Tarsus <dbl>, WingPitFat <dbl>, KeelFat <dbl>,
## # Crop <dbl>
hawks %>%
filter(Sex == "F" | Weight > 500)
## # A tibble: 741 × 19
## Month Day Year CaptureTime ReleaseTime BandNumber Species Age Sex
## <dbl> <dbl> <dbl> <chr> <time> <chr> <chr> <chr> <chr>
## 1 9 19 1992 13:30 NA 877-76317 RT I <NA>
## 2 9 22 1992 10:30 NA 877-76318 RT I <NA>
## 3 9 23 1992 12:45 NA 877-76319 RT I <NA>
## 4 9 23 1992 10:50 NA 745-49508 CH I F
## 5 9 27 1992 11:15 NA 1253-98801 SS I F
## 6 9 28 1992 11:25 NA 1207-55910 RT I <NA>
## 7 9 28 1992 13:30 NA 877-76320 RT I <NA>
## 8 9 29 1992 11:45 NA 877-76321 RT A <NA>
## 9 9 29 1992 15:35 NA 877-76322 RT A <NA>
## 10 9 30 1992 13:45 NA 1207-55911 RT I <NA>
## # ℹ 731 more rows
## # ℹ 10 more variables: Wing <dbl>, Weight <dbl>, Culmen <dbl>, Hallux <dbl>,
## # Tail <dbl>, StandardTail <dbl>, Tarsus <dbl>, WingPitFat <dbl>,
## # KeelFat <dbl>, Crop <dbl>
Challenge 3.1
Frequently you’ll want to create new columns based on the values in
existing columns, for example to do unit conversions, or to find the
ratio of values in two columns. For this we’ll use
mutate().
colnames(hawks)
## [1] "Month" "Day" "Year" "CaptureTime" "ReleaseTime"
## [6] "BandNumber" "Species" "Age" "Sex" "Wing"
## [11] "Weight" "Culmen" "Hallux" "Tail" "StandardTail"
## [16] "Tarsus" "WingPitFat" "KeelFat" "Crop"
hawks %>%
mutate(Weight_kg = Weight / 1000)
## # A tibble: 908 × 20
## Month Day Year CaptureTime ReleaseTime BandNumber Species Age Sex
## <dbl> <dbl> <dbl> <chr> <time> <chr> <chr> <chr> <chr>
## 1 9 19 1992 13:30 NA 877-76317 RT I <NA>
## 2 9 22 1992 10:30 NA 877-76318 RT I <NA>
## 3 9 23 1992 12:45 NA 877-76319 RT I <NA>
## 4 9 23 1992 10:50 NA 745-49508 CH I F
## 5 9 27 1992 11:15 NA 1253-98801 SS I F
## 6 9 28 1992 11:25 NA 1207-55910 RT I <NA>
## 7 9 28 1992 13:30 NA 877-76320 RT I <NA>
## 8 9 29 1992 11:45 NA 877-76321 RT A <NA>
## 9 9 29 1992 15:35 NA 877-76322 RT A <NA>
## 10 9 30 1992 13:45 NA 1207-55911 RT I <NA>
## # ℹ 898 more rows
## # ℹ 11 more variables: Wing <dbl>, Weight <dbl>, Culmen <dbl>, Hallux <dbl>,
## # Tail <dbl>, StandardTail <dbl>, Tarsus <dbl>, WingPitFat <dbl>,
## # KeelFat <dbl>, Crop <dbl>, Weight_kg <dbl>
You can also create a second new column based on the first new column
within the same call of mutate():
hawks %>%
mutate(Weight_kg = Weight / 1000,
Weight_lb = Weight_kg * 2.2)
## # A tibble: 908 × 21
## Month Day Year CaptureTime ReleaseTime BandNumber Species Age Sex
## <dbl> <dbl> <dbl> <chr> <time> <chr> <chr> <chr> <chr>
## 1 9 19 1992 13:30 NA 877-76317 RT I <NA>
## 2 9 22 1992 10:30 NA 877-76318 RT I <NA>
## 3 9 23 1992 12:45 NA 877-76319 RT I <NA>
## 4 9 23 1992 10:50 NA 745-49508 CH I F
## 5 9 27 1992 11:15 NA 1253-98801 SS I F
## 6 9 28 1992 11:25 NA 1207-55910 RT I <NA>
## 7 9 28 1992 13:30 NA 877-76320 RT I <NA>
## 8 9 29 1992 11:45 NA 877-76321 RT A <NA>
## 9 9 29 1992 15:35 NA 877-76322 RT A <NA>
## 10 9 30 1992 13:45 NA 1207-55911 RT I <NA>
## # ℹ 898 more rows
## # ℹ 12 more variables: Wing <dbl>, Weight <dbl>, Culmen <dbl>, Hallux <dbl>,
## # Tail <dbl>, StandardTail <dbl>, Tarsus <dbl>, WingPitFat <dbl>,
## # KeelFat <dbl>, Crop <dbl>, Weight_kg <dbl>, Weight_lb <dbl>
The first few rows of the output are full of NAs, so if
we wanted to remove those we could insert a filter() in the
chain:
hawks %>%
filter(!is.na(Weight)) %>%
mutate(Weight_kg = Weight / 1000) %>%
head()
## # A tibble: 6 × 20
## Month Day Year CaptureTime ReleaseTime BandNumber Species Age Sex Wing
## <dbl> <dbl> <dbl> <chr> <time> <chr> <chr> <chr> <chr> <dbl>
## 1 9 19 1992 13:30 NA 877-76317 RT I <NA> 385
## 2 9 22 1992 10:30 NA 877-76318 RT I <NA> 376
## 3 9 23 1992 12:45 NA 877-76319 RT I <NA> 381
## 4 9 23 1992 10:50 NA 745-49508 CH I F 265
## 5 9 27 1992 11:15 NA 1253-98801 SS I F 205
## 6 9 28 1992 11:25 NA 1207-55910 RT I <NA> 412
## # ℹ 10 more variables: Weight <dbl>, Culmen <dbl>, Hallux <dbl>, Tail <dbl>,
## # StandardTail <dbl>, Tarsus <dbl>, WingPitFat <dbl>, KeelFat <dbl>,
## # Crop <dbl>, Weight_kg <dbl>
Challenge 3.2
Many data analysis tasks can be approached using the
split-apply-combine paradigm: split the data into groups, apply
some analysis to each group, and then combine the results.
dplyr makes this very easy through the use
of the group_by() function.
hawks %>%
group_by(Sex) %>%
summarize(mean = mean(Weight, na.rm = TRUE))
## # A tibble: 3 × 2
## Sex mean
## <chr> <dbl>
## 1 F 257.
## 2 M 174.
## 3 <NA> 1090.
You can also group by multiple columns:
hawks %>%
group_by(Species, Sex) %>%
summarize(mean = mean(Weight, na.rm = TRUE))
## # A tibble: 9 × 3
## # Groups: Species [3]
## Species Sex mean
## <chr> <chr> <dbl>
## 1 CH F 490.
## 2 CH M 348.
## 3 CH <NA> 402
## 4 RT F 1147.
## 5 RT M 1080
## 6 RT <NA> 1094.
## 7 SS F 175.
## 8 SS M 119.
## 9 SS <NA> 95
More metrics
hawks %>%
group_by(Species, Sex) %>%
summarize(mean = mean(Weight, na.rm = TRUE),
min = min(Weight, na.rm = TRUE))
## # A tibble: 9 × 4
## # Groups: Species [3]
## Species Sex mean min
## <chr> <chr> <dbl> <dbl>
## 1 CH F 490. 56
## 2 CH M 348. 155
## 3 CH <NA> 402 324
## 4 RT F 1147. 1120
## 5 RT M 1080 1080
## 6 RT <NA> 1094. 101
## 7 SS F 175. 92
## 8 SS M 119. 85
## 9 SS <NA> 95 95
It is sometimes useful to rearrange the result of a query to inspect
the values. For instance, we can sort on min to put the
lowest numbers first:
hawks %>%
group_by(Species, Sex) %>%
summarize(mean = mean(Weight, na.rm = TRUE),
min = min(Weight, na.rm = TRUE)) %>%
arrange(min)
## # A tibble: 9 × 4
## # Groups: Species [3]
## Species Sex mean min
## <chr> <chr> <dbl> <dbl>
## 1 CH F 490. 56
## 2 SS M 119. 85
## 3 SS F 175. 92
## 4 SS <NA> 95 95
## 5 RT <NA> 1094. 101
## 6 CH M 348. 155
## 7 CH <NA> 402 324
## 8 RT M 1080 1080
## 9 RT F 1147. 1120
hawks %>%
group_by(Species, Sex) %>%
summarize(mean = mean(Weight, na.rm = TRUE),
min = min(Weight, na.rm = TRUE)) %>%
arrange(desc(min))
## # A tibble: 9 × 4
## # Groups: Species [3]
## Species Sex mean min
## <chr> <chr> <dbl> <dbl>
## 1 RT F 1147. 1120
## 2 RT M 1080 1080
## 3 CH <NA> 402 324
## 4 CH M 348. 155
## 5 RT <NA> 1094. 101
## 6 SS <NA> 95 95
## 7 SS F 175. 92
## 8 SS M 119. 85
## 9 CH F 490. 56
When working with data, we often want to know the number of
observations found for each factor or combination of factors. For this
task, dplyr provides count().
For example, if we wanted to count the number of rows of data for each
Sex, we would do:
hawks %>%
count(Sex)
## # A tibble: 3 × 2
## Sex n
## <chr> <int>
## 1 F 174
## 2 M 158
## 3 <NA> 576
The count() function is shorthand for something we’ve
already seen: grouping by a variable, and summarizing it by counting the
number of observations in that group. In other words,
hawks %>% count(Sex) is equivalent to:
hawks %>%
group_by(Sex) %>%
summarize(n = n())
## # A tibble: 3 × 2
## Sex n
## <chr> <int>
## 1 F 174
## 2 M 158
## 3 <NA> 576
We can also combine count() with other functions such as
filter(). Here we will count the number of each species
with weights above 800 g.
hawks %>%
filter(Weight > 500) %>%
count(Species)
## # A tibble: 3 × 2
## Species n
## <chr> <int>
## 1 CH 19
## 2 RT 566
## 3 SS 2
The example above shows the use of count() to count the
number of rows/observations for one factor (i.e.,
Species). If we wanted to count combination of
factors, such as Species and Sex, we
would specify the first and the second factor as the arguments of
count():
hawks %>%
filter(Weight > 500) %>%
count(Species, Sex)
## # A tibble: 6 × 3
## Species Sex n
## <chr> <chr> <int>
## 1 CH F 17
## 2 CH M 2
## 3 RT F 3
## 4 RT M 1
## 5 RT <NA> 562
## 6 SS M 2
With the above code, we can proceed with arrange() to
sort the table according to a number of criteria so that we have a
better comparison. For instance, we might want to arrange the table
above in (i) an alphabetical order of the levels of the sex and (ii) in
descending order of the count:
hawks %>%
filter(Weight > 500) %>%
count(Species, Sex) %>%
arrange(Sex, desc(n))
## # A tibble: 6 × 3
## Species Sex n
## <chr> <chr> <int>
## 1 CH F 17
## 2 RT F 3
## 3 CH M 2
## 4 SS M 2
## 5 RT M 1
## 6 RT <NA> 562
Challenge 3.4
# Filter out observations
hawks_rt_f <- hawks %>%
filter(Species == "RT" & Sex == "F")
write_csv(hawks_rt_f, file = "data_processed/Hawks_Red-Tailed_female.csv")